Wine Quality "Warm up" Challenge

Physicochemical factors that predict good quality wine


The "warm up" challenge for this year is adapted from the well-known 'Wine Quality' challenge on Kaggle. In particular, given a dataset containing several attributes describing wine, your task is to make predictions on the quality of as-yet unlisted wine samples. Developing a model which accurately fits the available training data while also generalising to unseen data-points is a multi-faceted challenge that involves a mixture of data exploration, pre-processing, model selection, and performance evaluation.

IMPORTANT: please refer to the AML course guidelines concerning grading rules. Pay especially attention to the presentation quality item, which boils down to: don't dump a zillion of lines of code and plots in this notebook. Produce a concise summary of your findings: this notebook can exist in two versions, a "scratch" version that you will use to work and debug, a "presentation" version that you will submit. The "presentation" notebook should go to the point, and convay the main findings of your work.


Overview

Beyond simply producing a well-performing model for making predictions, in this challenge we would like you to start developing your skills as a machine learning scientist. In this regard, your notebook should be structured in such a way as to explore the following tasks, that are expected to be carried out whenever undertaking such a project. The description below each aspect should serve as a guide for your work, but you are can also explore alternative options and directions. Thinking outside the box will be rewarded in these challenges.

Dataset description

You will be working on two data files, which will be available in /mnt/datasets/wine/, one for red and one for white wines:

  • winequality-red.csv
  • winequality-white.csv

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference Cortez et al., 2009. Only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

Tips

A possible trick is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. Note that this can be seen as a data preparation task.

Training and test sets

We leave to the students to decide how to carve out training and test sets (validation sets too, if relevant to your approach). This is non a competition whereby the instructors hold a "private" test set to rank students' models.

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Attributes

Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)

In [0]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, RandomForestClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVR, SVC
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_predict, train_test_split, GridSearchCV
In [64]:
from google.colab import files
uploaded = files.upload()
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving winequality-red.csv to winequality-red (1).csv
Saving winequality-white.csv to winequality-white (1).csv
In [65]:
red = pd.read_csv("winequality-red.csv", sep=';')
white = pd.read_csv("winequality-white.csv", sep=';')

print("Import complete")
Import complete

1. Data preparation:

Data exploration: The first broad component of your work should enable you to familiarise yourselves with the given data, an outline of which is given at the end of this challenge specification. Among others, you can work on:

  • Data cleaning, e.g. treatment of categorial variables;
  • Data visualisation; Computing descriptive statistics, e.g. correlation.
In [0]:
#red wine
red.head() #we notice that all data are numerical
Out[0]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
In [0]:
red.describe() #to have some statistics
Out[0]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000
In [0]:
red.info() #we notice that there is no empty value
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
In [0]:
sns.pairplot(red)
Out[0]:
<seaborn.axisgrid.PairGrid at 0x7f1ba5b4f9e8>

There is no clear correlation between any element and the quality... Maybe alcohol or density ?

In [0]:
# let's do the same with white wine
white.head() #we notice that all data are numerical
Out[0]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
In [0]:
white.describe() #to have some statistics
Out[0]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000
mean 6.854788 0.278241 0.334192 6.391415 0.045772 35.308085 138.360657 0.994027 3.188267 0.489847 10.514267 5.877909
std 0.843868 0.100795 0.121020 5.072058 0.021848 17.007137 42.498065 0.002991 0.151001 0.114126 1.230621 0.885639
min 3.800000 0.080000 0.000000 0.600000 0.009000 2.000000 9.000000 0.987110 2.720000 0.220000 8.000000 3.000000
25% 6.300000 0.210000 0.270000 1.700000 0.036000 23.000000 108.000000 0.991723 3.090000 0.410000 9.500000 5.000000
50% 6.800000 0.260000 0.320000 5.200000 0.043000 34.000000 134.000000 0.993740 3.180000 0.470000 10.400000 6.000000
75% 7.300000 0.320000 0.390000 9.900000 0.050000 46.000000 167.000000 0.996100 3.280000 0.550000 11.400000 6.000000
max 14.200000 1.100000 1.660000 65.800000 0.346000 289.000000 440.000000 1.038980 3.820000 1.080000 14.200000 9.000000
In [0]:
white.info() #we notice that there is no empty value
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB
In [0]:
sns.pairplot(white)
Out[0]:
<seaborn.axisgrid.PairGrid at 0x7fdce2d5e5f8>

Data Pre-processing: The previous step should give you a better understanding of which pre-processing is required for the data. This may include:

  • Normalising and standardising the given data;
  • Removing outliers;
  • Carrying out feature selection, possibly using metrics derived from information theory;
  • Handling missing information in the dataset;
  • Augmenting the dataset with external information;
  • Combining existing features.

Note that, as the name implies, this is a warm-up challenge, which essentially means that data is already put in a convenient format that requires minimal pre-processing.

In [0]:
# Let's normalize

r = red.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
r_scaled = min_max_scaler.fit_transform(r)
red_normalized = pd.DataFrame(r_scaled, columns=red.columns, index=red.index)
#red_normalized.head()

w = white.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
w_scaled = min_max_scaler.fit_transform(w)
white_normalized = pd.DataFrame(w_scaled, columns=white.columns, index=white.index)
#white_normalized.head()
In [0]:
red_y = red['quality']
red_X = red_normalized.drop(['quality'], axis=1)

white_y = white['quality']
white_X = white_normalized.drop(['quality'], axis=1)

Feature selection doen't seem necessary here... nor removing outliers !

Think about combining features ?? We don't have that much features.

2. Model selection

An important part of the work involves the selection of a model that can successfully handle the given data and yield sensible predictions. Instead of focusing exclusively on your final chosen model, it is also important to share your thought process in this notebook by additionally describing alternative candidate models. There is a wealth of models to choose from, such as decision trees, random forests, (Bayesian) neural networks, Gaussian processes, LASSO regression, and so on.

Irrespective of your choice, it is highly likely that your model will have one or more parameters that require tuning. There are several techniques for carrying out such a procedure, such as cross-validation.

Why this model ? blablabla

We first chose to use reggresion models since this plot seems to indicate a gradient in the quality that seems even almost linear:

In [0]:
plt.scatter(red_X.to_numpy()[:, 1], red_X.to_numpy()[:, 10], marker='o', c=red_y, s=25, alpha=0.8)
Out[0]:
<matplotlib.collections.PathCollection at 0x7f32350b97f0>

This is why, even though it seems to be a classfication problem, regressive models can nonetheless be relevant.

Data count:

In [91]:
values, count = np.unique(red_y, return_counts=True)
print(values, count)
[3 4 5 6 7 8] [ 10  53 681 638 199  18]
In [92]:
values, count = np.unique(white_y, return_counts=True)
print(values, count)
[3 4 5 6 7 8 9] [  20  163 1457 2198  880  175    5]
In [0]:
# Separation of test set and training set
Xr_train, Xr_test, yr_train, yr_test = train_test_split(red_X, red_y, test_size=0.2, random_state=0)
Xw_train, Xw_test, yw_train, yw_test = train_test_split(white_X, white_y, test_size=0.2, random_state=0)

Since the mae or mse scoring is not relevant to compare classifiers and regressors, we chose to cross-validate on the predictive performances of each model.

The plot just above make us think that a simple linear regression would not have that bad results since we have somehow a smooth gradient of the quality over some features (corresponding to the color gradient in the plotted figure). A linear regression model will thus be a starting point that can then be used to evaluate the performance of the other more complex models we will use after.

Regressive models

In [0]:
### Linear Regression
model = LinearRegression()

# With cross walidation scores, we have a better estimation of the efficiency of the model
ypred = np.around(cross_val_predict(model, Xr_train, yr_train.ravel(), cv=5)).astype(np.int64)
print("cross validation accuracy =", accuracy_score(yr_train.ravel(), ypred))
print("confusion matrix =\n", confusion_matrix(yr_train.ravel(), np.around(ypred).astype(np.int64)))
cross validation accuracy = 0.5777951524628616
confusion matrix =
 [[  0   1   6   1   0   0]
 [  0   1  26  15   0   0]
 [  0   3 378 162   2   1]
 [  0   0 151 319  26   0]
 [  0   0   6 125  41   0]
 [  0   0   0  10   5   0]]

# TO BE COMPLETED # --> Random forest regressor choice

In [0]:
### RANDOM FOREST
model = RandomForestRegressor(verbose = 0)

# With cross walidation scores, we have a better estimation of the efficiency of the model
ypred = np.around(cross_val_predict(model, Xr_train, yr_train.ravel(), cv=5)).astype(np.int64)
print("cross validation accuracy =", accuracy_score(yr_train.ravel(), ypred))
print("confusion matrix =\n", confusion_matrix(yr_train.ravel(), np.around(ypred).astype(np.int64)))
cross validation accuracy = 0.6450351837372947
confusion matrix =
 [[  0   3   4   1   0   0]
 [  0   3  28  10   1   0]
 [  0   2 411 129   4   0]
 [  0   0 126 337  33   0]
 [  0   0   5  93  74   0]
 [  0   0   0   6   9   0]]

# TO BE COMPLETED # --> GBR choice

In [0]:
### GRADIENT BOOSTING REGRESSOR
model = GradientBoostingRegressor(verbose = 0)

# With cross walidation scores, we have a better estimation of the efficiency of the model
ypred = np.around(cross_val_predict(model, Xr_train, yr_train.ravel(), cv=5)).astype(np.int64)
print("cross validation accuracy =", accuracy_score(yr_train.ravel(), ypred))
print("confusion matrix =\n", confusion_matrix(yr_train.ravel(), np.around(ypred).astype(np.int64)))
cross validation accuracy = 0.6043784206411259
confusion matrix =
 [[  0   2   6   0   0   0]
 [  0   2  26  14   0   0]
 [  0   5 402 134   5   0]
 [  0   0 154 305  37   0]
 [  0   0   5 103  64   0]
 [  0   0   0  12   3   0]]

# TO BE COMPLETED # --> SVR choice

In [0]:
### SVM REGRESSOR
model = SVR()

# With cross walidation scores, we have a better estimation of the efficiency of the model
ypred = np.around(cross_val_predict(model, Xr_train, yr_train.ravel(), cv=5)).astype(np.int64)
print("cross validation accuracy =", accuracy_score(yr_train.ravel(), ypred))
print("confusion matrix =\n", confusion_matrix(yr_train.ravel(), np.around(ypred).astype(np.int64)))
cross validation accuracy = 0.6051602814698983
confusion matrix =
 [[  0   0   8   0   0   0]
 [  0   0  33   8   1   0]
 [  0   0 413 130   3   0]
 [  0   0 163 300  33   0]
 [  0   0   9 102  61   0]
 [  0   0   0  11   4   0]]

Classification models

Maybe, regressive models are not as relevant as we first thought, so we tried to see the performances of classification models.

First, logistic regression as a classifier is probably the simplest model, was the linear regression.

In [0]:
### Logistic Regression
model = LogisticRegression()

# With cross walidation scores, we have a better estimation of the efficiency of the model
ypred = cross_val_predict(model, Xr_train, yr_train.ravel(), cv=5)
print("cross validation accuracy =", accuracy_score(yr_train.ravel(), ypred))
print("confusion matrix =\n", confusion_matrix(yr_train.ravel(), ypred))
cross validation accuracy = 0.5785770132916341
confusion matrix =
 [[  0   0   6   2   0   0]
 [  0   0  29  13   0   0]
 [  0   0 417 127   2   0]
 [  0   0 182 291  23   0]
 [  0   0  11 129  32   0]
 [  0   0   0   9   6   0]]

# TO BE COMPLETED # --> Random forest choice

In [0]:
### Random forest classification
model = RandomForestClassifier()

# With cross walidation scores, we have a better estimation of the efficiency of the model
ypred = cross_val_predict(model, Xr_train, yr_train.ravel(), cv=5)
print("cross validation accuracy =", accuracy_score(yr_train.ravel(), ypred))
print("confusion matrix =\n", confusion_matrix(yr_train.ravel(), ypred))
cross validation accuracy = 0.6653635652853792
confusion matrix =
 [[  0   2   6   0   0   0]
 [  1   1  27  13   0   0]
 [  0   2 436 101   7   0]
 [  0   0 135 327  34   0]
 [  0   0   7  80  85   0]
 [  0   0   0   7   6   2]]

# TO BE COMPLETED # --> SVC choice

In [0]:
### SVM CLASSIFIER
model = SVC()

# With cross walidation scores, we have a better estimation of the efficiency of the model
ypred = cross_val_predict(model, Xr_train, yr_train.ravel(), cv=5)
print("cross validation accuracy =", accuracy_score(yr_train.ravel(), ypred))
print("confusion matrix =\n", confusion_matrix(yr_train.ravel(), ypred))
cross validation accuracy = 0.5840500390930414
confusion matrix =
 [[  0   0   7   1   0   0]
 [  0   0  30  11   1   0]
 [  0   0 410 134   2   0]
 [  0   0 177 294  25   0]
 [  0   0  13 116  43   0]
 [  0   0   0  13   2   0]]

KNN may be a relevant classifier, though "expensive" in terms of prediction computing. We also need to estimate the right "k", the number of neighbors taken into account.

In [0]:
n_neighbors = np.arange(1, 50, 1)
accs = []
for n_neighbor in n_neighbors:
    model = KNeighborsClassifier(n_neighbors=n_neighbor, weights='distance')
    ypred = cross_val_predict(model, Xr_train, yr_train.ravel(), cv=4)
    accs.append(accuracy_score(ypred, yr_train))
plt.plot(n_neighbors, accs)
plt.xlabel("k (number of neighbors)")
plt.ylabel("cross validation accuracy")
plt.show()

We see that k is optimal around 15 with a cross validation accuracy of 0.65.

Hyperparameter tuning

Best results obtained with a random forest classifier (without hyperparameter tuning) and KNN (with optimal K found)

To optimize our forest model, we'll use grid search.

In [0]:
model =  RandomForestClassifier(verbose=0)
model.get_params().keys()
Out[0]:
dict_keys(['bootstrap', 'ccp_alpha', 'class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_jobs', 'oob_score', 'random_state', 'verbose', 'warm_start'])
In [0]:
parameters = {
    "n_estimators":[10, 15, 20],
    "min_samples_split": np.linspace(0.001, 0.005, 5),
    "min_samples_leaf": np.linspace(0.0001, 0.001, 5),
    "max_depth":[12, 8, 4],
    "max_features":["auto", "log2", "sqrt"],
    "criterion": ["gini",  "entropy"],
    }

grid_search = GridSearchCV(model, parameters, verbose = 1, n_jobs=-1)

grid_search.fit(Xr_train, yr_train.values.ravel())
print(grid_search.score(Xr_train, yr_train.values.ravel()))
print(grid_search.best_params_)
Fitting 5 folds for each of 1350 candidates, totalling 6750 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 164 tasks      | elapsed:    5.7s
[Parallel(n_jobs=-1)]: Done 764 tasks      | elapsed:   26.1s
[Parallel(n_jobs=-1)]: Done 1764 tasks      | elapsed:   58.2s
[Parallel(n_jobs=-1)]: Done 3164 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 4964 tasks      | elapsed:  2.9min
0.962470680218921
{'criterion': 'gini', 'max_depth': 12, 'max_features': 'log2', 'min_samples_leaf': 0.000325, 'min_samples_split': 0.003, 'n_estimators': 20}
[Parallel(n_jobs=-1)]: Done 6750 out of 6750 | elapsed:  4.0min finished
In [70]:
bestModel_r = RandomForestClassifier(n_estimators=20, min_samples_split=0.003, min_samples_leaf=0.000325, max_depth=12, max_features="log2", criterion="gini")
bestModel_r.fit(Xr_train, yr_train.values.ravel())
Out[70]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=12, max_features='log2',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=0.000325, min_samples_split=0.003,
                       min_weight_fraction_leaf=0.0, n_estimators=20,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
In [71]:
print("Accuracy on test set:", accuracy_score(bestModel_r.predict(Xr_test), yr_test))
Accuracy on test set: 0.73125
In [0]:
parameters = {
    "n_estimators":[10, 20],
    "min_samples_split": np.linspace(0.0001, 0.0005, 5),
    "min_samples_leaf": np.linspace(0.000001, 0.00001, 5),
    "max_depth":[12, 8, 4],
    "max_features":["auto", "log2", "sqrt"],
    "criterion": ["gini",  "entropy"],
    }

grid_search = GridSearchCV(model, parameters, verbose = 1, n_jobs=-1)

grid_search.fit(Xw_train, yw_train.values.ravel())
print(grid_search.score(Xw_train, yw_train.values.ravel()))
print(grid_search.best_params_)
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    4.5s
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:   15.3s
[Parallel(n_jobs=-1)]: Done 446 tasks      | elapsed:   33.4s
[Parallel(n_jobs=-1)]: Done 796 tasks      | elapsed:   58.3s
[Parallel(n_jobs=-1)]: Done 1246 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 2084 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 3384 tasks      | elapsed:  4.2min
[Parallel(n_jobs=-1)]: Done 4500 out of 4500 | elapsed:  5.4min finished
0.9515058703420113
{'criterion': 'entropy', 'max_depth': 12, 'max_features': 'log2', 'min_samples_leaf': 5.5e-06, 'min_samples_split': 0.0002, 'n_estimators': 20}
In [72]:
bestModel_w = RandomForestClassifier(n_estimators=20, min_samples_split=0.0002, min_samples_leaf=5.5e-06, max_depth=12, max_features="log2", criterion="entropy")
bestModel_w.fit(Xw_train, yw_train.values.ravel())
Out[72]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=12, max_features='log2',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5.5e-06, min_samples_split=0.0002,
                       min_weight_fraction_leaf=0.0, n_estimators=20,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
In [73]:
print("Accuracy on test set:", accuracy_score(bestModel_w.predict(Xw_test), yw_test))
Accuracy on test set: 0.6295918367346939

3. Performance evaluation

The evaluation metric for this project is "Log Loss". For the N wines in the test data set, the metric is calculated as:

$\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} y_i p_i + (1-y_i) \log(1-p_i)$

where $y$ is the true (but withheld) quality outcome for wine $i$ in the test data set, and $p$ is the predicted probability of good quality for wine $i$. Larger values of $\mathcal{L}$ indicate poorer predictions.

In [0]:
yr_predict = bestModel_r.predict(Xr_test)
yw_predict = bestModel_w.predict(Xw_test) 
In [0]:
def accuracy(y, y_hat):
  return np.sum(np.abs(y - y_hat) < 0.5) / y.size

def mae(y, y_hat):
  return np.mean(np.abs(y-y_hat))
def rmse(y, y_hat):
  return np.sqrt(np.mean((y-y_hat)**2))

Red wine classification

In [86]:
print("accuracy =", accuracy_score(yr_test.ravel(), yr_predict))
print("mae =", mae(yr_test.ravel().astype(np.float64), yr_predict))
print("rmse =", rmse(yr_test.ravel().astype(np.float64), yr_predict))
accuracy = 0.73125
mae = 0.309375
rmse = 0.6349212549600147

White wine classification

In [87]:
print("accuracy =", accuracy_score(yw_test.ravel(), yw_predict))
print("mae =", mae(yw_test.ravel().astype(np.float64), yw_predict))
print("rmse =", rmse(yw_test.ravel().astype(np.float64), yw_predict))
accuracy = 0.6295918367346939
mae = 0.4357142857142857
rmse = 0.7646527823470077

$mae < rmse$ cause when the values are different, the error is greater than 1!

In [0]: